Chris Pollett > Old Classses >
CS267
( Print View )

Student Corner:
  [Grades Sec1]

  [Submit Sec1]

  [Lecture Notes]
  [Discussion Board]

Course Info:
  [Texts & Links]
  [Description]
  [Course Outcomes]
  [Outcomes Matrix]
  [Course Schedule]
  [Grading]
  [Requirements/HW/Quizzes]
  [Class Protocols]
  [Exam Info]
  [Regrades]
  [University Policies]
  [Announcements]

HW Assignments:
  [Hw1] [Hw2] [Hw3]
  [Hw4] [Hw5] [Quizzes]

Practice Exams:
  [Midterm] [Final]

HW#2 --- last modified February 05 2019 20:15:58..

Solution set.

Due date: Mar 16

Files to be submitted:
Hw2.zip

Purpose:

Related Course Outcomes:

The main course outcomes covered by this assignment are:

CLO1 -- Code a basic inverted index capable of performing conjunctive queries.

CLO2 -- Be able to calculate by hand on small examples precision (fraction relevant results returned), recall (fraction of results which are relevant), and other IR statistics.

CLO6 -- Be able to evaluate search results by hand and using TREC eval software.

Specification:

The homework consists of three parts: Some written problems, a coding exercise, and an evaluations task. Materials for all three parts should be zipped together and submitted as Hw2.zip. If you are working in a team, have a readme.txt listing your team mates and their student ids. Below are the written problems I'd like you to do:

Use induction on the logical complexity (the number of connectives) of a positive Boolean query to prove the correctness of the nextSolution(Q, position) algorithm we had in class to find the next document that satisfies a positive Boolean query.
Do Exercise 2.3 out of the book.
Do Exercise 2.7 out of the book.

For the coding part of your homework I want you to expand upon the code you wrote for HW1. Now your program will be run from the command-line with a command like:

php search_program.php some_dir query ranking_method tokenization_method

some_dir is, as in the last assignment, a folder containing plain text file documents. We will assume these documents are written in American English. query is a one or more term search query (if more than one term put the whole query in quotes). ranking_method should be one of cosine or proximity -- this will determine how you are going to rank the results you return. Finally, tokenization_method should be one of none, stem, or chargram.

On a reasonable set of command line inputs, your program should use glob to get the *.txt file names in the folder some_dir . We will treat the position in the array returned by glob as the document id of a document. Before your program does tokenization, I want you to first convert the text to lower case and then strip punctuation. Then depending on the tokenization_method , I want you to either make tokens using just white space as the delimiter, or to use white space and then apply either an English stemmer or do chargramming. For the stemming or chargramming, I want you to make use of Yioop as a library using Composer. You can have your project depend on version 3.4.0 of Yioop. For chargramming, you can use the function getNGramsTerm where $n is defined as 5. For this homework, your program does not output the inverted index. Instead, I want you to store these informations into a class which has the four public methods first, last, prev, and next, described in class for the inverted index ADT. Your implementation of prev and next should use galloping search. The command-line argument query above can be either a single word for example "bob", it can be a string of words for example "bob smith". If it is more that one term, when I run your program I will put the sequence of terms in double-quotes, so that it will come in to your program as a single command-line argument. As an example, I might run your program with a line like:

php search_program.php ./ "apple records" cosine chargram

On an input as above, your program should return a sequence of lines, each line containing a pair (doc_id, rank). Here doc_id should be a document which contains the word apple and contains the word records (not necessarily adjacent to each other). Here rank should be the cosine or proximity rank score for that document with respect to this query. You should output results in descending order of rank. In computing cosine rank, use TF_IDF values for the vector component where the TF function is the example one given in the book.

For the last part of your assignment, I want you to create a small text collection of 20 documents and put them in a subfolder my_test_files as part of what you submit. Name the files: file1.txt, file2.txt ... Make up a list of six queries (two single word, two string of words, two three words). For each of your queries and each of the documents in my_test_files make a 1 (yes), 0 (no) judgment about the relevance of that document to that query. Each of your queries should have the key words searched on as well as a description of the words you expect to get back. Number your queries and put your queries and their descriptions in queries.txt. For instance, you might have a query as your query number 1, the query "apple records", with a description of the query being the record company of the Beatles. As part of your description, you might say a document about Apple computers which mentions the word records should not be considered a relevant result for this query. In a separate text file, judgments.txt, you should lists the judgments you came up with above. This file should consist of lines like (query_no, doc_no, relevance) which record this information. View a score of 0 for a document as that document not being returned. Now fix the rank mode as either (your choice) proximity or cosine ranking, I want you to compute by hand or by script the recall, precision, and MAP scores of your search program with respect to each query using your judgments in the case where char-gramming, stemming, or nothing is used. Put your calculations into the file scores.txt are have at least one paragraph where you draw conclusions.

Point Breakdown

Written problems (each problem 1pt - 1 fully correct, 0.5 any defect or incomplete, 0 didn't do or was wrong)	3pts
Coding Guidelines and Code Quality (how split into functions and classes, etc.)	0.5pt
Code to read in documents and do initial lower-casing and punctuation stripping.	0.5pt
tokenization_method handling implemented using Yioop and Composer	1pt
InvertedIndex ADT implemented as described. For example, next and prev use galloping search.	1pt
Correctly computes the documents for a conjunctive query	0.5pt
Correctly computes the proximity and cosine ranking	1pt
Test files folder, queries.txt, and judgments.txt as described (0.5pt each)	1.5pt
scores.txt and one paragraph conclusion as described	1pt
Total	10pts